Introduction

The Charlottesville bars are looking for ways to increase their attendance next semester and are looking for different ways to do it. One area they have selected is the playlist of songs for their bar. If they are able to find songs that people will be more likely to dance to, this might increase their attendance and popularity. In this project we will be looking into a data set of songs and creating machine learning models to try and predict a danceability rating for them. We will be using three different model types: k-nearest neighbors, (KNN), decision tree and random forest. For each model, we created an initial model and then tried to tune it to be able to best predict our data. We will be looking to predict whether a particular song will be in the top 25% percentile in terms of its danceability rating and will therefore be a song with optimal qualities to place on a playlist for nights at bars. Two metrics that we will be focusing on in this project are specificity as well as the F1 score. Specificity is important for this situation because songs that are classified as being in the top 25% percentile when they should not be could be detrimental to the atmosphere of a bar and could turn away bar-goers. F1 score will also be important in evaluating our models as the data is imbalanced.

Data Analysis

We obtained the dataset we are using from Kaggle. It contains songs from the major genres with various metrics about them including acousticness, energy, key, loudness, tempo, genre and several other metrics. Our models will use these various metrics to try and predict whether a song will be in the top 25% percentile in danceability rating. To clean our dataset to prepare to feed it into the models, we first removed variables that we decided not to use either because they were identifying values such as song name and artist or because they had erroneous data in them. For example, about half of the values in the column for the song’s duration reported a duration of -1 milliseconds. We then removed NA values and converted the tempo column to numeric values. We also normalized the popularity, loudness and tempo columns to be between 0 and 1 using a simple min-max scaler. Next, we collapsed the factors of the key column to combine sharp and natural notes of the same type as well as classified the genre and mode columns as factors. We calculated the 75th percentile of the danceability scores and created a binary variable indicating whether a song was in the top 25% or not, replacing the original column with danceability scores. Finally, we split our dataset into train, tune and test partitions for use with our models. Below are some summary statistics about important variables in our data as well as a table showing our final cleaned data.

# load required libraries
library(C50)
library(caret)
library(class)
library(DT)
library(data.table)
library(MLmetrics)
library(mlbench)
library(mltools)
library(ROCR)
library(randomForest)
library(tidyverse)

music_genre_data <- read_csv("music_genre.csv")

# Removing identifier variables and mismanaged data columns (~50% of the duration variable indicated a song length of -1 ms)
music_genre = music_genre_data[-c(1,2,3,7,9,16)]

# convert '?' to NA
music_genre[music_genre == "?"] <- NA

# remove NAs
music_genre <- music_genre[complete.cases(music_genre),]

music_genre$tempo = as.numeric(music_genre$tempo)

normalize = function(x){
 (x - min(x)) / (max(x) - min(x))
}

# normalize popularity, loudness and valence columns
music_genre[c(1,7,10)] = lapply(music_genre[c(1,7,10)], normalize)

# collapse factors of key column to group sharps and naturals
music_genre$key = fct_collapse(music_genre$key, 
                               A = c("A", "A#"),
                               B = c("B", "B#"),
                               C = c("C", "C#"),
                               D = c("D", "D#"),
                               E = c("E", "E#"),
                               F = c("F", "F#"),
                               G = c("G", "G#"))

# change "Hip-Hop" value in genre column to a usable R name
music_genre[music_genre == "Hip-Hop"] <- "HipHop"

# rename genre column from "music_genre" to "genre"
names(music_genre)[names(music_genre)=="music_genre"] = "genre" 

# change mode and music_genre columns to factor
music_genre$mode = as.factor(music_genre$mode) 
music_genre$genre = as.factor(music_genre$genre)

lapply(music_genre[c(5,8,12)], table)
## $key
## 
##    A    B    C    D    E    F    G 
## 7387 3398 9889 6146 3379 6660 8161 
## 
## $mode
## 
## Major Minor 
## 28874 16146 
## 
## $genre
## 
## Alternative       Anime       Blues   Classical     Country  Electronic 
##        4495        4497        4470        4500        4486        4466 
##      HipHop        Jazz         Rap        Rock 
##        4520        4521        4504        4561

Each of the factor variables are well balanced within the dataset.

Boxplot of Danceability:

# find the 75th percentile of danceable songs
boxplot(music_genre$danceability)

danceabilitySummary <- summary(music_genre$danceability)
danceabilitySummary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0596  0.4420  0.5680  0.5585  0.6870  0.9860
print("The 75th percentile of danceability includes values of .687 and above")
## [1] "The 75th percentile of danceability includes values of .687 and above"
music_genre$danceability = (ifelse(music_genre$danceability > danceabilitySummary[5], 1, 0))
music_genre$danceability = fct_collapse(as.factor(music_genre$danceability), "bottom75" = "0", "top25" = "1")

# split up data into train, tune and test
set.seed(3001)
part1_indexing = createDataPartition(music_genre$danceability,
                                     times = 1,
                                     p = 0.70,
                                     groups=1,
                                     list=FALSE)

train = music_genre[part1_indexing,]
tune_and_test = music_genre[-part1_indexing,]

tune_and_test_index = createDataPartition(tune_and_test$danceability,
                                          p = .5,
                                          list = FALSE,
                                          times = 1)

tune = tune_and_test[tune_and_test_index,]
test = tune_and_test[-tune_and_test_index,]

# show nicely formatted table of cleaned data
datatable(music_genre) 

Our Models

KNN

25 Neighbors

For our KNN model, we started using 25 nearest neighbors and every variable in the dataset. This model performed well with the negative class, which, fortunately is the class we are more interested in correctly predicting. However, at a sensitivity of only 50%, we would miss one of every two songs that would be “danceable”, and that may be too low for the bars.

KNN_train = train[-c(3,5,12)]
train1h = one_hot(as.data.table(KNN_train),cols = "auto",sparsifyNAs = TRUE,naCols = TRUE,dropCols = TRUE,dropUnusedLevels = TRUE)
KNN_tune = tune[-c(3,5,12)]
tune1h = one_hot(as.data.table(KNN_tune),cols = "auto",sparsifyNAs = TRUE,naCols = TRUE,dropCols = TRUE,dropUnusedLevels = TRUE)
Music_25NN = knn(train = train1h,
                test = tune1h,
                cl = train$danceability,
                k = 25,
                use.all = TRUE,
                prob = TRUE)

confusionMatrix(as.factor(Music_25NN), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4733   918
##   top25         334   768
##                                           
##                Accuracy : 0.8146          
##                  95% CI : (0.8051, 0.8238)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4405          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4555          
##             Specificity : 0.9341          
##          Pos Pred Value : 0.6969          
##          Neg Pred Value : 0.8376          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1137          
##    Detection Prevalence : 0.1632          
##       Balanced Accuracy : 0.6948          
##                                           
##        'Positive' Class : top25           
## 
#Establishing dataframe with both the prediction and probability
Music_25NN_Prob = data.frame(pred = as_factor(Music_25NN), prob = attr(Music_25NN, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_25NN_Prob$prob = ifelse(Music_25NN_Prob$pred == "bottom75", 1 - Music_25NN_Prob$prob, Music_25NN_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_25NN = as_factor(ifelse(Music_25NN_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_25NN, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.549964054636952"

We assumed popularity, energy, liveliness, loudness, and tempo would best align with danceability, so we attempted to feature engineer the KNN model by only using these variables. However, this led to major decreases in both sensitivity and accuracy.

#Selecting the variables I predict would most closely correlate to danceability (popularity, energy, liveliness, loudness, and tempo)
Music_25NN_tuned = knn(train = train[c(1,4,6,7,10)],
                test = tune[c(1,4,6,7,10)],
                cl = train$danceability,
                k = 25,
                use.all = TRUE,
                prob = TRUE)
confusionMatrix(as.factor(Music_25NN_tuned), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4641  1186
##   top25         426   500
##                                           
##                Accuracy : 0.7613          
##                  95% CI : (0.7509, 0.7714)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : 0.01902         
##                                           
##                   Kappa : 0.2501          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.29656         
##             Specificity : 0.91593         
##          Pos Pred Value : 0.53996         
##          Neg Pred Value : 0.79646         
##              Prevalence : 0.24967         
##          Detection Rate : 0.07404         
##    Detection Prevalence : 0.13712         
##       Balanced Accuracy : 0.60624         
##                                           
##        'Positive' Class : top25           
## 
#Worsened accuracy (especially sensitivity)

100 Neighbors

When adjusting to 100 neighbors, the specificity marginally increased; however, this likely occurred because higher k’s favor the more prevalent class.

#Sensitivity drops a lot for a small improvement in specificity (likely more-so due to the imbalanced nature of the data than it predicting the observations better)
Music_100NN = knn(train = train1h,
                test = tune1h,
                cl = train$danceability,
                k = 100,
                use.all = TRUE,
                prob = TRUE)
confusionMatrix(as.factor(Music_100NN), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4783   984
##   top25         284   702
##                                           
##                Accuracy : 0.8122          
##                  95% CI : (0.8027, 0.8215)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4183          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4164          
##             Specificity : 0.9440          
##          Pos Pred Value : 0.7120          
##          Neg Pred Value : 0.8294          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1040          
##    Detection Prevalence : 0.1460          
##       Balanced Accuracy : 0.6802          
##                                           
##        'Positive' Class : top25           
## 
#Establishing dataframe with both the prediction and probability
Music_100NN_Prob = data.frame(pred = as_factor(Music_100NN), prob = attr(Music_100NN, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_100NN_Prob$prob = ifelse(Music_100NN_Prob$pred == "bottom75", 1 - Music_100NN_Prob$prob, Music_100NN_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_100NN = as_factor(ifelse(Music_100NN_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_100NN, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.520723436322532"
Music_100NN_tuned = knn(train = train[c(1,4,6,7,10)],
                test = tune[c(1,4,6,7,10)],
                cl = train$danceability,
                k = 25,
                use.all = TRUE,
                prob = TRUE)
confusionMatrix(as.factor(Music_100NN_tuned), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4639  1187
##   top25         428   499
##                                          
##                Accuracy : 0.7608         
##                  95% CI : (0.7505, 0.771)
##     No Information Rate : 0.7503         
##     P-Value [Acc > NIR] : 0.02334        
##                                          
##                   Kappa : 0.2489         
##                                          
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.29597        
##             Specificity : 0.91553        
##          Pos Pred Value : 0.53830        
##          Neg Pred Value : 0.79626        
##              Prevalence : 0.24967        
##          Detection Rate : 0.07389        
##    Detection Prevalence : 0.13727        
##       Balanced Accuracy : 0.60575        
##                                          
##        'Positive' Class : top25          
## 

5 Neighbors

At only 5 neighbors, the specificity and accuracy decreased because the model did not have enough information to predict danceability well. Overall, our first model of 25 neighbors with all variables likely best fit our business question.

Music_5NN = knn(train = train1h,
                test = tune1h,
                cl = train$danceability,
                k = 5,
                use.all = TRUE,
                prob = TRUE)
confusionMatrix(as.factor(Music_5NN), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4555   857
##   top25         512   829
##                                           
##                Accuracy : 0.7973          
##                  95% CI : (0.7875, 0.8068)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4193          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4917          
##             Specificity : 0.8990          
##          Pos Pred Value : 0.6182          
##          Neg Pred Value : 0.8416          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1228          
##    Detection Prevalence : 0.1986          
##       Balanced Accuracy : 0.6953          
##                                           
##        'Positive' Class : top25           
## 
#Slightly increased sensitivity, but not enough and at the expense of accuracy

#Establishing dataframe with both the prediction and probability
Music_5NN_Prob = data.frame(pred = as_factor(Music_5NN), prob = attr(Music_5NN, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_5NN_Prob$prob = ifelse(Music_5NN_Prob$pred == "bottom75", 1 - Music_5NN_Prob$prob, Music_5NN_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_5NN = as_factor(ifelse(Music_5NN_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_5NN, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.548108825481088"
Music_5NN_tuned = knn(train = train[c(1,4,6,7,10)],
                test = tune[c(1,4,6,7,10)],
                cl = train$danceability,
                k = 5,
                use.all = TRUE,
                prob = TRUE)
confusionMatrix(as.factor(Music_5NN_tuned), as.factor(tune$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4423  1056
##   top25         644   630
##                                           
##                Accuracy : 0.7483          
##                  95% CI : (0.7377, 0.7586)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : 0.659           
##                                           
##                   Kappa : 0.2685          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.37367         
##             Specificity : 0.87290         
##          Pos Pred Value : 0.49451         
##          Neg Pred Value : 0.80726         
##              Prevalence : 0.24967         
##          Detection Rate : 0.09329         
##    Detection Prevalence : 0.18866         
##       Balanced Accuracy : 0.62328         
##                                           
##        'Positive' Class : top25           
## 
#Best model is likely the first model made (the increased in specificity were not worth the decreases in sensitivity) 

Decision Tree

For our decision tree, we first cross-validated a c5.0 model to find the ideal number of boosting iterations along with whether or not it should winnow (remove) variables of low importance. Ultimately, we used 20 boosting iterations and no winnowing, but this process was very computationally expensive.

Our initial model performed better than the KNN model, but we can still improve through tuning. We adjusted the hyperparameters of: 1. The number of minimum cases in each leaf node 2. The confidence factor (the threshold of error allowed in the data; the higher the number, the less pruning in the model.

c5_model = C5.0(danceability~.,
                    method = "class",
                    parms = list(split = "gini"),
                    data = train,
                    trials = 20,
                    control = C5.0Control(winnow = FALSE,
                                          minCases = 500))

varImp(c5_model)
##              Overall
## loudness      100.00
## tempo         100.00
## genre         100.00
## valence        98.29
## acousticness   92.05
## energy         89.98
## liveness       89.98
## speechiness    89.44
## popularity     48.89
## mode           36.47
## key            15.58
plot(c5_model)

The first split is for genre; intuitively, this makes sense because of the booleans that follow. Rap and Hip hop will be most danceable when they are high energy/fast tempo while other genres may be more suited towards slow dancing that work best with low energy/slow tempo

dance_prob = as_tibble(predict(c5_model, tune, type = "prob"))
dance_pred = predict(c5_model, tune, type = "class")
confusionMatrix(as.factor(dance_pred),
                as.factor(tune$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4781   769
##   top25         286   917
##                                           
##                Accuracy : 0.8438          
##                  95% CI : (0.8349, 0.8524)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.539           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5439          
##             Specificity : 0.9436          
##          Pos Pred Value : 0.7623          
##          Neg Pred Value : 0.8614          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1358          
##    Detection Prevalence : 0.1781          
##       Balanced Accuracy : 0.7437          
##                                           
##        'Positive' Class : top25           
## 
#table = table(as.factor(dance_pred),
#      as.factor(tune$danceability))
#(spec = table[1]/(table[1]+table[2]))
#F1 Score at .5 threshold
pred_5 = as_factor(ifelse(dance_prob$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.634821737625476"

Tuning

Number of Cases

ggplot(empty, aes(x = reorder(as.factor(min_cases), -F1), y = F1))+ geom_col(width = .8)+ geom_bar(data=subset(empty, min_cases == 50), aes(as.factor(min_cases), F1),
              fill="green", stat="identity")

ggplot(empty, aes(x = reorder(as.factor(min_cases), -specificity), y = specificity))+ geom_col(width = .8)+ geom_bar(data=subset(empty, min_cases == 50), aes(as.factor(min_cases), specificity),
              fill="green", stat="identity")

#50 seems to be the best compromise

Confidence Factor

ggplot(empty2, aes(x = reorder(as.factor(CF_level), -F1), y = F1))+ geom_col(width = .8)+ geom_bar(data=subset(empty2, CF_level==.9), aes(as.factor(CF_level), F1),
              fill="green", stat="identity")

ggplot(empty2, aes(x = reorder(as.factor(CF_level), -specificity), y = specificity))+ geom_col(width = .8)+ geom_bar(data=subset(empty2, CF_level==.9), aes(as.factor(CF_level), specificity),
              fill="green", stat="identity")

Final Model Against Tune Data

We found 50 cases and a confidence factor of .9 to be the best compromise between F1-score and specificity. The final model marginally decreased in specificity but greatly increased in sensitivity and F1.

c5_model_tune = C5.0(danceability~.,
                    method = "class",
                    parms = list(split = "gini"),
                    data = train,
                    trials = 20,
                    control = C5.0Control(winnow = FALSE,
                                          minCases = 50,
                                          CF = .9))
dance_prob_tune = as_tibble(predict(c5_model_tune, tune, type = "prob"))
dance_pred_tune = predict(c5_model_tune, tune, type = "class")
confusionMatrix(as.factor(dance_pred_tune),
                as.factor(tune$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4720   631
##   top25         347  1055
##                                           
##                Accuracy : 0.8552          
##                  95% CI : (0.8466, 0.8635)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5904          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6257          
##             Specificity : 0.9315          
##          Pos Pred Value : 0.7525          
##          Neg Pred Value : 0.8821          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1562          
##    Detection Prevalence : 0.2076          
##       Balanced Accuracy : 0.7786          
##                                           
##        'Positive' Class : top25           
## 
#F1 Score at .5 threshold
pred_5_tune = as_factor(ifelse(dance_prob_tune$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_tune, y_true = as_factor(tune$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.683290155440415"
#Best model so far, should do roc curve to find threshold
dance_prob_tune = as_tibble(dance_prob_tune)
dance_eval_tune = tibble(pred_class=dance_pred_tune, pred_prob=dance_prob_tune$top25,target=as.numeric(tune$danceability))
pred = prediction(dance_eval_tune$pred_prob, dance_eval_tune$target)
#Choosing to evaluate the true positive rate and false positive rate based on the threshold
ROC_curve = performance(pred,"tpr","fpr")
plot(ROC_curve, colorize=TRUE)
abline(a=0, b= 1)

tree_perf_AUC = performance(pred,"auc")
print(paste("AUC =",tree_perf_AUC@y.values))
## [1] "AUC = 0.906414075118206"
#.5 is near the elbow of the graph (maybe could try .4 and .6)

Thresholding

@ .4

dance_pred_4 = as_factor(ifelse(dance_prob$top25 > 0.4, "top25", "bottom75"))
confusionMatrix(as.factor(dance_pred_4),
                as.factor(tune$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4592   565
##   top25         475  1121
##                                           
##                Accuracy : 0.846           
##                  95% CI : (0.8372, 0.8545)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5815          
##                                           
##  Mcnemar's Test P-Value : 0.005784        
##                                           
##             Sensitivity : 0.6649          
##             Specificity : 0.9063          
##          Pos Pred Value : 0.7024          
##          Neg Pred Value : 0.8904          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1660          
##    Detection Prevalence : 0.2363          
##       Balanced Accuracy : 0.7856          
##                                           
##        'Positive' Class : top25           
## 

The increases in sensitivity are not worth the decreases in specificity because it is our metric of interest.

@ .6

dance_pred_6 = as_factor(ifelse(dance_prob$top25 > 0.6, "top25", "bottom75"))
confusionMatrix(as.factor(dance_pred_6),
                as.factor(tune$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4897   987
##   top25         170   699
##                                           
##                Accuracy : 0.8287          
##                  95% CI : (0.8195, 0.8376)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4545          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4146          
##             Specificity : 0.9664          
##          Pos Pred Value : 0.8044          
##          Neg Pred Value : 0.8323          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1035          
##    Detection Prevalence : 0.1287          
##       Balanced Accuracy : 0.6905          
##                                           
##        'Positive' Class : top25           
## 

The improvements in specificity are not worth the great losses in sensitivity.

Random Forest

# function to calculate the mtry level 
mytry_tune <- function(x){
  xx <- dim(x)[2]-1
  sqrt(xx)
}
set.seed(3001)
rfInit = randomForest(danceability~.,      #<- Formula: response variable ~ predictors.
                      #   The period means 'use all other variables in the data'.
                      train,               #<- A data frame with the variables to be used.
                      #y = NULL,           #<- A response vector. This is unnecessary because we're specifying a response formula.
                      #subset = NULL,      #<- This is unnecessary because we're using all the rows in the training data set.
                      #xtest = NULL,       #<- This is already defined in the formula by the ".".
                      #ytest = NULL,       #<- This is already defined in the formula by "PREGNANT".
                      ntree = 500,        #<- Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets classified at least a few times.
                      mtry = mytry_tune(music_genre), #<- Number of variables randomly sampled as candidates at each split. Default number for classification is sqrt(# of variables). Default number for regression is (# of variables / 3).
                      replace = TRUE,      #<- Should sampled data points be replaced.
                      #classwt = NULL,     #<- Priors of the classes. Use this if you want to specify what proportion of the data SHOULD be in each class. This is relevant if your sample data is not completely representative of the actual population 
                      #strata = NULL,      #<- Not necessary for our purpose here.
                      sampsize = 3000,      #<- Size of sample to draw each time.
                      nodesize = 5,        #<- Minimum numbers of data points in terminal nodes.
                      #maxnodes = NULL,    #<- Limits the number of maximum splits. 
                      importance = TRUE,   #<- Should importance of predictors be assessed?
                      #localImp = FALSE,   #<- Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
                      proximity = FALSE,    #<- Should a proximity measure between rows be calculated?
                      norm.votes = TRUE,   #<- If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
                      do.trace = TRUE,     #<- If set to TRUE, give a more verbose output as randomForest is run.
                      keep.forest = TRUE,  #<- If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
                      keep.inbag = TRUE)   #<- Should an n by ntree matrix be returned that keeps track of which samples are in-bag in which trees?

# function to show call, variable importance and confusion matrix given a model
showModelOutput <- function(mdl, modelName) {
  print("Call")
  print(mdl$call)
  print('Variable Importance')
  print(mdl$importance)
  varImpPlot(mdl, main=modelName)
  plot(mdl, main=modelName)
  mdlPredict = predict(mdl,
                       tune,
                       type = "response",
                       predict.all = FALSE,
                       proximity = FALSE)
  confusionMatrix(as.factor(mdlPredict),
                as.factor(tune$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
}

# function to output an F-1 score given a model
f1score = function(mdl){  
  mdlPredictprob = as_tibble(predict(mdl,
                    tune,
                    type = "prob",
                    predict.all = FALSE,
                    proximity = FALSE))
  
  
  pred_5_mdl = as_factor(ifelse(mdlPredictprob$top25 > 0.5, "top25", "bottom75"))
  print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_mdl, y_true = as_factor(tune$danceability), positive = "top25")))
}

# function to run the model with a given number of trees and randomly sampled variables
tuneModel <- function(numTrees, mTry) {
  set.seed(3001)    
  rf = randomForest(danceability~.,      #<- Formula: response variable ~ predictors.
                    #   The period means 'use all other variables in the data'.
                    train,                #<- A data frame with the variables to be used.
                    #y = NULL,           #<- A response vector. This is unnecessary because we're specifying a response formula.
                    #subset = NULL,      #<- This is unnecessary because we're using all the rows in the training data set.
                    #xtest = NULL,       #<- This is already defined in the formula by the ".".
                    #ytest = NULL,       #<- This is already defined in the formula by "PREGNANT".
                    ntree = numTrees,        #<- Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets classified at least a few times.
                    mtry = mTry,            #<- Number of variables randomly sampled as candidates at each split. Default number for classification is sqrt(# of variables). Default number for regression is (# of variables / 3).
                    replace = TRUE,      #<- Should sampled data points be replaced.
                    #classwt = NULL,     #<- Priors of the classes. Use this if you want to specify what proportion of the data SHOULD be in each class. This is relevant if your sample data is not completely representative of the actual population 
                    #strata = NULL,      #<- Not necessary for our purpose here.
                    sampsize = 3000,      #<- Size of sample to draw each time.
                    nodesize = 5,        #<- Minimum numbers of data points in terminal nodes.
                    #maxnodes = NULL,    #<- Limits the number of maximum splits. 
                    importance = TRUE,   #<- Should importance of predictors be assessed?
                    #localImp = FALSE,   #<- Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
                    proximity = FALSE,    #<- Should a proximity measure between rows be calculated?
                    norm.votes = TRUE,   #<- If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
                    do.trace = TRUE,     #<- If set to TRUE, give a more verbose output as randomForest is run.
                    keep.forest = TRUE,  #<- If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
                    keep.inbag = TRUE)   #<- Should an n by ntree matrix be returned that keeps track of which samples are in-bag in which trees?
  return(rf)
}

rf1 = tuneModel(250, 3)
rf2 = tuneModel(750, 3)
rf3 = tuneModel(1000, 3)
rf4 = tuneModel(250, 4)
rf5 = tuneModel(500, 4)
rf6 = tuneModel(750, 4)
rf7 = tuneModel(1000, 4)

Initial Model

showModelOutput(rfInit, 'Initial Model')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = 500, 
##     mtry = mytry_tune(music_genre), replace = TRUE, sampsize = 3000, 
##     nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0028082400 0.0297349585         0.0095300355        66.049461
## acousticness 0.0221714834 0.0268983771         0.0233517180        87.225350
## energy       0.0439789286 0.0198807787         0.0379645350        99.501703
## key          0.0003181391 0.0026313539         0.0008954235        52.845671
## liveness     0.0027286920 0.0067857634         0.0037413657        68.515600
## loudness     0.0194395663 0.0054037370         0.0159367254        67.903111
## mode         0.0007975935 0.0002112042         0.0006512238         7.134999
## speechiness  0.0265565318 0.0458372893         0.0313685228       127.479963
## tempo        0.0084609073 0.0398508084         0.0162957907       107.873982
## valence      0.0126543110 0.0463523295         0.0210648290       115.383466
## genre        0.0346956844 0.1146131197         0.0546443362       176.929165

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4795   747
##   top25         272   939
##                                           
##                Accuracy : 0.8491          
##                  95% CI : (0.8403, 0.8576)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5555          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5569          
##             Specificity : 0.9463          
##          Pos Pred Value : 0.7754          
##          Neg Pred Value : 0.8652          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1390          
##    Detection Prevalence : 0.1793          
##       Balanced Accuracy : 0.7516          
##                                           
##        'Positive' Class : top25           
## 
f1score(rfInit)
## [1] "F-1 Score at a .5 threshold: 0.646855563234278"

250 Trees with 3 Randomly Sampled Variables

showModelOutput(rf1, '250 Trees with 3 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0028034922 0.0286603324         0.0092590448        65.514197
## acousticness 0.0224116992 0.0252892875         0.0231300479        85.598486
## energy       0.0466930599 0.0195978548         0.0399319590       101.485561
## key          0.0002951075 0.0030493668         0.0009823990        53.015681
## liveness     0.0026328644 0.0065500539         0.0036106661        68.483126
## loudness     0.0202080817 0.0046957784         0.0163376597        68.494976
## mode         0.0008123317 0.0003138149         0.0006879538         7.328699
## speechiness  0.0275490084 0.0453778688         0.0319990136       128.621822
## tempo        0.0083738784 0.0401332938         0.0162998785       107.399952
## valence      0.0127894155 0.0469588975         0.0213172223       116.396327
## genre        0.0348725311 0.1138615015         0.0545869002       176.328575

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4787   748
##   top25         280   938
##                                          
##                Accuracy : 0.8478         
##                  95% CI : (0.839, 0.8563)
##     No Information Rate : 0.7503         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5522         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.5563         
##             Specificity : 0.9447         
##          Pos Pred Value : 0.7701         
##          Neg Pred Value : 0.8649         
##              Prevalence : 0.2497         
##          Detection Rate : 0.1389         
##    Detection Prevalence : 0.1804         
##       Balanced Accuracy : 0.7505         
##                                          
##        'Positive' Class : top25          
## 
f1score(rf1)
## [1] "F-1 Score at a .5 threshold: 0.645027624309392"

750 Trees with 3 Randomly Sampled Variables

showModelOutput(rf2, '750 Trees with 3 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0026352312 0.0299939521         0.0094631353         65.92190
## acousticness 0.0213912856 0.0274745531         0.0229099082         87.21707
## energy       0.0425295248 0.0201312823         0.0369400841         98.92768
## key          0.0003559939 0.0025749304         0.0009096668         52.47710
## liveness     0.0028145191 0.0065896949         0.0037569159         68.65797
## loudness     0.0193361446 0.0052561346         0.0158229870         68.41699
## mode         0.0008073480 0.0002389608         0.0006654871          7.26443
## speechiness  0.0260895894 0.0457414228         0.0309939135        126.99184
## tempo        0.0085384197 0.0403734133         0.0164834476        108.57072
## valence      0.0124648080 0.0457061269         0.0207604290        115.19557
## genre        0.0343391656 0.1160565503         0.0547346378        177.76502

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4799   750
##   top25         268   936
##                                           
##                Accuracy : 0.8493          
##                  95% CI : (0.8405, 0.8577)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5552          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5552          
##             Specificity : 0.9471          
##          Pos Pred Value : 0.7774          
##          Neg Pred Value : 0.8648          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1386          
##    Detection Prevalence : 0.1783          
##       Balanced Accuracy : 0.7511          
##                                           
##        'Positive' Class : top25           
## 
f1score(rf2)
## [1] "F-1 Score at a .5 threshold: 0.647282796815507"

1000 Trees with 3 Randomly Sampled Variables

showModelOutput(rf3, '1000 Trees with 3 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0027166174 0.0298640409         0.0094931241        66.123537
## acousticness 0.0206330249 0.0273184314         0.0223022812        86.062525
## energy       0.0418746953 0.0204750002         0.0365341273        98.908287
## key          0.0003613395 0.0023986769         0.0008696947        52.335094
## liveness     0.0027710398 0.0066395753         0.0037367792        68.758263
## loudness     0.0190288220 0.0055507681         0.0156650561        68.474415
## mode         0.0007790151 0.0002131674         0.0006377681         7.280667
## speechiness  0.0262426278 0.0458532197         0.0311377663       127.039464
## tempo        0.0085760195 0.0404317843         0.0165273109       108.911625
## valence      0.0124620964 0.0457422752         0.0207688024       114.732622
## genre        0.0341784379 0.1162955839         0.0546763345       177.847999

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4799   747
##   top25         268   939
##                                           
##                Accuracy : 0.8497          
##                  95% CI : (0.8409, 0.8581)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5568          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5569          
##             Specificity : 0.9471          
##          Pos Pred Value : 0.7780          
##          Neg Pred Value : 0.8653          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1390          
##    Detection Prevalence : 0.1787          
##       Balanced Accuracy : 0.7520          
##                                           
##        'Positive' Class : top25           
## 
f1score(rf3)
## [1] "F-1 Score at a .5 threshold: 0.649153128240581"

250 Trees with 4 Randomly Sampled Variables

showModelOutput(rf4, '250 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0021069248 0.0301827390         0.0091114684        63.460726
## acousticness 0.0215261149 0.0244784442         0.0222617471        84.910277
## energy       0.0441774110 0.0202964868         0.0382161685       102.338821
## key          0.0003943582 0.0022576506         0.0008592551        52.876098
## liveness     0.0026858278 0.0068944839         0.0037366359        70.569347
## loudness     0.0188357309 0.0043609772         0.0152247219        64.368892
## mode         0.0006677411 0.0001788687         0.0005457108         6.770831
## speechiness  0.0270498898 0.0448028211         0.0314799100       124.934020
## tempo        0.0092647960 0.0428661158         0.0176491298       114.613336
## valence      0.0143455969 0.0493261326         0.0230750192       120.600984
## genre        0.0377116565 0.1227544696         0.0589338879       188.437400

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4791   729
##   top25         276   957
##                                           
##                Accuracy : 0.8512          
##                  95% CI : (0.8425, 0.8596)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5637          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5676          
##             Specificity : 0.9455          
##          Pos Pred Value : 0.7762          
##          Neg Pred Value : 0.8679          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1417          
##    Detection Prevalence : 0.1826          
##       Balanced Accuracy : 0.7566          
##                                           
##        'Positive' Class : top25           
## 
f1score(rf4)
## [1] "F-1 Score at a .5 threshold: 0.65430827325781"

500 Trees with 4 Randomly Sampled Variables

showModelOutput(rf5, '500 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75       top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0022303421 0.029081292         0.0089301939        62.970153
## acousticness 0.0209604834 0.025464504         0.0220844360        84.177256
## energy       0.0435635985 0.020111480         0.0377090409       100.916646
## key          0.0003187114 0.002511198         0.0008658181        53.691086
## liveness     0.0027421037 0.007008868         0.0038068898        70.357032
## loudness     0.0186393241 0.004205398         0.0150374037        64.791202
## mode         0.0006633483 0.000144631         0.0005338648         6.966023
## speechiness  0.0266254183 0.045246869         0.0312725506       124.242029
## tempo        0.0092047879 0.044082530         0.0179084027       116.091403
## valence      0.0142293411 0.049178236         0.0229515894       119.903127
## genre        0.0381948041 0.122052868         0.0591235032       188.648798

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4795   734
##   top25         272   952
##                                           
##                Accuracy : 0.851           
##                  95% CI : (0.8423, 0.8594)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5624          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5647          
##             Specificity : 0.9463          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.8672          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1410          
##    Detection Prevalence : 0.1813          
##       Balanced Accuracy : 0.7555          
##                                           
##        'Positive' Class : top25           
## 
f1score(rf5)
## [1] "F-1 Score at a .5 threshold: 0.653819683413627"

750 Trees with 4 Randomly Sampled Variables

showModelOutput(rf6, '750 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0023475280 0.0290621781         0.0090131985        63.387407
## acousticness 0.0210153213 0.0260702110         0.0222762663        84.024737
## energy       0.0434541648 0.0209415764         0.0378348005       100.754690
## key          0.0003259417 0.0024115271         0.0008464242        53.950837
## liveness     0.0027709160 0.0068952264         0.0038000770        70.360076
## loudness     0.0188266921 0.0043728043         0.0152188948        65.291249
## mode         0.0006892206 0.0001863619         0.0005636816         6.830315
## speechiness  0.0266887768 0.0456021366         0.0314095277       124.756277
## tempo        0.0092703640 0.0439295714         0.0179196095       115.695801
## valence      0.0139625578 0.0492097970         0.0227590552       120.161151
## genre        0.0377956426 0.1213235163         0.0586417271       187.920914

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4796   734
##   top25         271   952
##                                           
##                Accuracy : 0.8512          
##                  95% CI : (0.8425, 0.8596)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5627          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5647          
##             Specificity : 0.9465          
##          Pos Pred Value : 0.7784          
##          Neg Pred Value : 0.8673          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1410          
##    Detection Prevalence : 0.1811          
##       Balanced Accuracy : 0.7556          
##                                           
##        'Positive' Class : top25           
## 
f1score(rf6)
## [1] "F-1 Score at a .5 threshold: 0.654282765737874"

1000 Trees with 4 Randomly Sampled Variables

showModelOutput(rf7, '1000 Trees with 4 mtry')
## [1] "Call"
## randomForest(formula = danceability ~ ., data = train, ntree = numTrees, 
##     mtry = mTry, replace = TRUE, sampsize = 3000, nodesize = 5, 
##     importance = TRUE, proximity = FALSE, norm.votes = TRUE, 
##     do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## [1] "Variable Importance"
##                  bottom75        top25 MeanDecreaseAccuracy MeanDecreaseGini
## popularity   0.0023803830 0.0294547451         0.0091363689        64.780576
## acousticness 0.0209007604 0.0263255993         0.0222551631        84.584563
## energy       0.0432719177 0.0206952529         0.0376365048       100.473251
## key          0.0003379178 0.0023090811         0.0008299471        53.976000
## liveness     0.0027112367 0.0067801025         0.0037266605        69.982193
## loudness     0.0189376095 0.0040407258         0.0152192093        65.452419
## mode         0.0006551678 0.0001891369         0.0005387720         6.811971
## speechiness  0.0264732573 0.0450157173         0.0311020025       124.169033
## tempo        0.0092899192 0.0440812127         0.0179728595       115.550913
## valence      0.0141687552 0.0489990978         0.0228617403       119.912000
## genre        0.0376795038 0.1204929300         0.0583484363       186.637727

## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4798   728
##   top25         269   958
##                                           
##                Accuracy : 0.8524          
##                  95% CI : (0.8437, 0.8607)
##     No Information Rate : 0.7503          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5666          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5682          
##             Specificity : 0.9469          
##          Pos Pred Value : 0.7808          
##          Neg Pred Value : 0.8683          
##              Prevalence : 0.2497          
##          Detection Rate : 0.1419          
##    Detection Prevalence : 0.1817          
##       Balanced Accuracy : 0.7576          
##                                           
##        'Positive' Class : top25           
## 
f1score(rf7)
## [1] "F-1 Score at a .5 threshold: 0.657731958762887"

Testing our Final Models

KNN

KNN_test = test[-c(3,5,12)]
test1h = one_hot(as.data.table(KNN_test),cols = "auto",sparsifyNAs = TRUE,naCols = TRUE,dropCols = TRUE,dropUnusedLevels = TRUE) 
Music_25NN_Final = knn(train = train1h,
                test = test1h,
                cl = train$danceability,
                k = 25,
                use.all = TRUE,
                prob = TRUE)
confusionMatrix(as.factor(Music_25NN_Final), as.factor(test$danceability), positive = "top25", dnn=c("Prediction", "Actual"), mode = "sens_spec")
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction bottom75 top25
##   bottom75     4665   917
##   top25         402   768
##                                         
##                Accuracy : 0.8047        
##                  95% CI : (0.795, 0.814)
##     No Information Rate : 0.7504        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.4192        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.4558        
##             Specificity : 0.9207        
##          Pos Pred Value : 0.6564        
##          Neg Pred Value : 0.8357        
##              Prevalence : 0.2496        
##          Detection Rate : 0.1137        
##    Detection Prevalence : 0.1733        
##       Balanced Accuracy : 0.6882        
##                                         
##        'Positive' Class : top25         
## 
#Establishing dataframe with both the prediction and probability
Music_25NN_Final_Prob = data.frame(pred = as_factor(Music_25NN_Final), prob = attr(Music_25NN_Final, "prob"))
#Adjusting so probability aligns with probability of "top25" for all observations
Music_25NN_Final_Prob$prob = ifelse(Music_25NN_Final_Prob$pred == "bottom75", 1 - Music_25NN_Final_Prob$prob, Music_25NN_Final_Prob$prob)
#Finding F1 score at .5 threshold
pred_5_KNN = as_factor(ifelse(Music_25NN_Final_Prob$prob > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_KNN, y_true = as_factor(test$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.536978618997546"

Decision Tree

dance_prob_test = as_tibble(predict(c5_model_tune, test, type = "prob"))
dance_pred_test = predict(c5_model_tune, test, type = "class")
confusionMatrix(as.factor(dance_pred_test),
                as.factor(test$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4654   654
##   top25         413  1031
##                                           
##                Accuracy : 0.842           
##                  95% CI : (0.8331, 0.8506)
##     No Information Rate : 0.7504          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5569          
##                                           
##  Mcnemar's Test P-Value : 2.022e-13       
##                                           
##             Sensitivity : 0.6119          
##             Specificity : 0.9185          
##          Pos Pred Value : 0.7140          
##          Neg Pred Value : 0.8768          
##              Prevalence : 0.2496          
##          Detection Rate : 0.1527          
##    Detection Prevalence : 0.2139          
##       Balanced Accuracy : 0.7652          
##                                           
##        'Positive' Class : top25           
## 
#F1 Score at .5 threshold
pred_5_test = as_factor(ifelse(dance_prob_test$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_test, y_true = as_factor(test$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.65899648449984"

Random Forest

rfPredict = predict(rf7,
                    test,
                    type = "response",
                    predict.all = FALSE,
                    proximity = FALSE)
rfPredictprob = as_tibble(predict(rf7,
                    test,
                    type = "prob",
                    predict.all = FALSE,
                    proximity = FALSE))
confusionMatrix(as.factor(rfPredict),
                as.factor(test$danceability),
                dnn = c("Predicted", "Actual"),
                mode = "sens_spec",
                positive = "top25")
## Confusion Matrix and Statistics
## 
##           Actual
## Predicted  bottom75 top25
##   bottom75     4738   742
##   top25         329   943
##                                         
##                Accuracy : 0.8414        
##                  95% CI : (0.8324, 0.85)
##     No Information Rate : 0.7504        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.5388        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.5596        
##             Specificity : 0.9351        
##          Pos Pred Value : 0.7414        
##          Neg Pred Value : 0.8646        
##              Prevalence : 0.2496        
##          Detection Rate : 0.1397        
##    Detection Prevalence : 0.1884        
##       Balanced Accuracy : 0.7474        
##                                         
##        'Positive' Class : top25         
## 
pred_5_rf = as_factor(ifelse(rfPredictprob$top25 > 0.5, "top25", "bottom75"))
print(paste("F-1 Score at a .5 threshold:",F1_Score(y_pred = pred_5_rf, y_true = as_factor(test$danceability), positive = "top25")))
## [1] "F-1 Score at a .5 threshold: 0.637347767253045"

Overall Findings

Next Steps

Moving forward, I would recommend the bars use our single tree model if they value correctly predicting “dancy” songs and the random forest if they want to ensure they do not play a buzzkill because they are more conservative. In the future, it may be productive to try the model using regression methods instead of classification because the danceability variable begins as a continuous variable.